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Abstract — In this paper, we consider recovery of jointly sparse 
multichannel signals from incomplete measurements. Several 
approaches have been developed to recover the unknown sparse 
vectors from the given observations, including thresholding, 
simultaneous orthogonal matching pursuit (SOMP), and convex 
relaxation based on a mixed matrix norm. Typically, worst-case 
analysis is carried out in order to analyze conditions under which 
the algorithms are able to recover any jointly sparse set of vectors. 
However, such an approach is not able to provide insights into 
why joint sparse recovery is superior to applying standard sparse 
reconstruction methods to each channel individually. Previous 
work considered an average case analysis of thresholding and 
SOMP by imposing a probability model on the measured signals. 
In this paper, our main focus is on analysis of convex relaxation 
techniques. In particular, we focus on the mixed £2,1 approach 
to multichannel recovery. We show that under a very mild 
condition on the sparsity and on the dictionary characteristics, 
measured for example by the coherence, the probability of 
recovery failure decays exponentially in the number of channels. 
This demonstrates that most of the time, multichannel sparse 
recovery is indeed superior to single channel methods. Our 
probability bounds are valid and meaningful even for a small 
number of signals. Using the tools we develop to analyze the 
convex relaxation method, we also tighten the previous bounds 
for thresholding and SOMP. 

Key Words: Multichannel sparse recovery, mixed-norm 
optimization, average performance, thresholding, simulta- 
neous orthogonal matching pursuit 

I. Introduction 

Recovery of sparse signals from a small number of mea- 
surements is a fundamental problem in many different signal 
processing tasks such as image denoising [8], analog-to-digital 
conversion [30], [19], [32], radar, compression, inpainting, 
and many more. The recent framework of compressed sensing 
(CS), founded in the works of Donoho [15], Candes, Romberg 
and Tao [8], studies acquisition methods as well as efficient 
computational algorithms that allow reconstruction of a sparse 
vector x from linear measurements y = Ax, where A € R nx N 
is referred to as the measurement matrix. The key observation 
is that y can be relatively short, so that n < N, and still 
contain enough information to recover x. 
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Determining the sparsest vector x consistent with the data 
y = Ax is generally an NP-hard problem [14]. To determine x 
in practice, a multitude of efficient algorithms have been pro- 
posed, [14], [18], [43], [7], [9], which achieve high recovery 
rates. The basis pursuit (BP), or £1 -minimization approach, 
is the most extensively studied recovery method [12], [8], 
[15], [35]. The use of general purpose or specialized convex 
optimization techniques [26], [18] allows for efficient recon- 
struction using this strategy. Although greedy methods, such 
as simple thresholding or orthogonal matching pursuit (OMP), 
are faster in practice, BP provides significantly better recovery 
guarantees. In particular, there exist measurement matrices 
A e M. nxN that allow for stable recovery of all fc-sparse 
vectors as long as n > Ck\og(N/k) where C is a constant. 
Such uniform recovery is not possible for simple thresholding 
or OMP [16], [36]. (We note, however, that the recent greedy 
algorithms CoSaMP [33] and ROMP [34] are able to provide 
such uniform guarantees.) In practice, the recovery rate of 
BP when averaged over all random sparse vectors is typically 
better than that predicted by the theory. This is due to the fact 
that existing analysis considers the ability of BP to recover 
all vectors x. On the other hand, in random simulations, the 
worst-case instance of x typically does not occur. Therefore, 
considering the behavior of various recovery methods over 
random x often leads to more characteristic behavior. 

The BP principle as well as greedy approaches have been 
extended to the multichannel setup where the signal consists 
of several channels with joint sparsity support [47], [45], [22], 
[13], [11], [31], [20], [21]. In [2] the buzzword distributed 
compressed sensing was coined for this setup. An alternative 
approach is to first reduce the problem to a single channel 
problem that preserves the sparsity pattern, and recover the 
signal support set; given the support, the measurements can 
be inverted to recover the input [31]. A variety of different 
recovery results have been established that provide conditions 
ensuring that the output of the proposed efficient algorithms 
coincides with the true signals. In [11] a recovery result was 
derived for a mixed l v ,\ program in which the objective is to 
minimize the sum of the ^ p -norms of the rows of the estimated 
matrix whose columns are the unknown vectors. Recovery 
results for the more general problem of block-sparsity were 
developed in [21] based on the block restricted isometry prop- 
erty (RIP), and in [20] based on mutual coherence. In practice, 
multichannel reconstruction techniques perform much better 
than recovering each channel individually. However, the the- 
oretical equivalence results predict no performance gain. The 
reason is that these results apply to all possible input signals, 
and are therefore worst-case measures. Clearly, if we input the 
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same signal to each channel, then no additional information 
on the joint support is provided from multiple measurements. 
Therefore, in this worst-case scenario there is no advantage 
for multiple channels. 

In order to capture more closely the true underlying be- 
havior of existing algorithms and observe a performance 
gain when using several channels, we consider an average- 
case analysis. In this setting, the inputs are considered to be 
random variables. The idea is to develop conditions on the 
measurement matrix A such that the inputs can be recovered 
with high probability given a certain input distribution. 

Recently, there have been several papers that consider 
sparse recovery with random ensembles. In [46] random sub- 
dictionaries of A are considered and analyzed. This allows to 
obtain average results for BP with a single input channel. In 
[40], average-case performance of single channel thresholding 
was studied. In [25], [24] extensions to two multichannel 
recovery algorithms were developed: thresholding and simulta- 
neous OMP (SOMP) [25], [24]. Under a mild condition on the 
sparsity and on the matrix A, the probability of reconstruction 
failure decays exponentially with the number of channels. In 
the present paper we contribute to this line of research by 
analyzing the average-case performance of multichannel BP, 
i.e., mixed £2.i-minimization [45], [22], [21], [20]. The tools 
we derive in this context are then also used to slightly improve 
previous bounds on average performance of multichannel 
thresholding and SOMP. 

The theoretical average-case results we develop for multi- 
channel BP are superior to the average bounds developed on 
thresholding and SOMP. For an equally mild or even milder 
condition on the sparsity and on the matrix A, we obtain faster 
exponential decay of the failure probability with respect to the 
number of channels. Thus, in this sense, the extension of BP to 
the multichannel case is superior to existing greedy algorithms, 
just as in the single channel setting. Moreover, our recovery 
results are applicable also in the single channel case whereas 
previous results [25] require a large number of channels to 
yield meaningful (i.e., positive) probability bounds (although 
our new bound for thresholding generalizing the one in [40] 
does not suffer from this drawback). Note, however, that in 
simulations SOMP often exhibits the best performance. This 
may be explained by the fact that the bounds are not tight (at 
least for SOMP). 

To develop our probability bounds, we rely on a new 
sufficient condition that ensures recovery of the exact signal 
set via £2,i-minimization. This condition generalizes a result 
of [44], [23] to the multichannel setting, and is weaker than 
existing multichannel recovery conditions. Our average-case 
analysis is then carried out assuming that the elements of 
the input signal are drawn at random. We prove that under 
a certain restriction on A and the sparsity set S, the sufficient 
condition we develop is satisfied with high probability. The 
restriction we impose is that the ^2-norm of A s ag over all £ 
not in the set S is bounded, where ag is the Ah column of 
A, and A s is the pseudo inverse of the restriction of A to the 
columns in S. This is an improvement over known worst-case 
recovery conditions which require a bound on the £i-norm 
[11], [20], and are therefore stronger. Loosely speaking, we 



will show that while worst-case results limit the sparsity level 
to order y/n, average-case analysis shows that sparsity up to 
order n may enable recovery with high probability. In terms 
of RIP constants, instead of bounding the RIP constant for 
sparsity sets of size 2k, we will only need to consider sets of 
size k + 1. 

The remaining of the paper is organized as follows. In 
Section [TT] we introduce our problem and briefly summarize 
known equivalence results between the £2,1 approach for 
multichannel recovery and NP-hard combinatorial optimiza- 
tion that recovers the true signals. A new recovery condition 
is derived in Section Hill which is weaker than previous 
results, and will be instrumental in developing our average- 
case analysis in Section [IV] Since the probability bounds 
we develop depend on the 2-norm of A s ag, in Section [V] 
we derive several upper bounds on this norm. In Section [VI] 
we use the tools developed in the previous section to derive 
new bounds on the average performance of thresholding and 
SOMP, that are tighter than existing results and also applicable 
to a broader set of problems. We then compare our bounds on 
multichannel BP to these results. Finally, in Section IVHI we 
present several simulations demonstrating the behavior of the 
different methods. 

Throughout the paper, we denote by As the submatrix of A 
consisting of the columns indexed by S C {1, . . . , N}, while 
X s is the submatrix of X consisting of the rows of X indexed 
by S. The £th column of A is denoted by ag or Ag. For a matrix 
A, \\A\\2 is the spectral norm of A, i.e., the largest singular 
value, and A* is its conjugate transpose. The unit sphere in 
R L is defined by S^ 1 = {x € M L , \\x\\ 2 = 1}; the complex 
counterpart is denoted Sjj-T 1 ={z£ C L , ||x||2 = 1}- 

II. Multichannel £1 -Minimization 
A. Problem Formulation 

We consider multichannel signal recovery where our goal 
is to recover a jointly-sparse matrix X 6 C NxL from n linear 
measurements per channel. Here N denotes the signal length 
and L the number of channels, i.e., the number of signals. 
We assume that X is jointly fc-sparse, meaning that there are 
at most k rows in the matrix X that are not identically zero. 
More formally, we define the support of the matrix X as 

L 

suppX = [J suppA^, (1) 

where the support of the Ah column is 

suppJQ = {j,A>^0}. (2) 

Our assumption is that \X\q :— | supp X\ < k. The measure- 
ments are given by 

Y = AX, YeC nxL , (3) 

where A 6 (£™ xAr j s a gi ven measurement matrix. Each 
measurement vector Yg = AXg corresponds to a measurement 
of the corresponding signal Xg. 

The natural approach to determine X given Y is to solve 
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the ^-minimization problem 



mm||X|| 



s.t. 



AX = Y. 



(4) 



However, (O is NP hard in general [14]. Several alternative 
methods have been proposed, that have polynomial complexity 
[47], [45], [22], [13], [11], [31], [20], [21], [31]. A variety of 
different equivalence results between the solution of the Iq- 
problem and the output of the proposed efficient algorithm. 
In [11] an equivalence result was derived for a mixed l Pt x 
program in which the objective is to minimize the sum of the 
£p-norms of the rows of the estimated matrix whose columns 
are the unknown vectors. The condition is based on mutual 
coherence, and turns out to be the same as that obtained from 
a single measurement problem, so that the joint sparsity pattern 
does not lead to improved recovery capabilities as judged by 
this condition. Recovery results for the more general problem 
of block-sparsity were developed in [21] based on the RIP, and 
in [20] based on mutual coherence. Reducing these results to 
the multiple measurement vectors (MMV) setting leads again 
to conditions that are the same as in the single measurement 
case. An exception is the work in [25], [24] which considers 
average-case performance of thresholding and SOMP. Under 
a mild condition on the sparsity and on the matrix A, the 
probability of reconstruction failure decays exponentially with 
the number of channels L. In Section [VTI we slightly improve 
on these bounds using the tools developed in this paper. 

In Section [IV] we follow a similar approach and treat the 
average behavior of the mixed f 2.1-minimization program [45], 
[22], [21] defined by 



min||X|| 2i i 



JV 

Ei 



subject to AX = Y, (5) 



which promotes joint sparsity, as argued for instance in [22]. 
In the single channel case L = 1 this is the usual BP principle. 
Therefore, our results can also be used to deduce the average- 
case behavior of the BP method. This is in contrast to [25], 
in which the recovery results derived are not applicable to 
the single channel case. As we discuss in Section |VI] our 
theoretical results are superior to the previous average-case 
analysis of [25] in the sense that we use an equally mild or 
even milder condition on the sparsity and on the matrix A, but 
at the same time get a faster exponential decay of the failure 
probability with respect to the number of channels L. 

B. Recovery Results 

Recovery results for the program (f5]l were considered in 
[11], [21], [20]. In particular, the lemma below is derived in 
[11] and follows also from [20] where the more general case 
of block sparsity is considered. 

Proposition 2.1: Let Sc{l,..., N} and suppose that 

pty^Hi < 1 for alii £S, (6) 

with At = (AgAs)^ 1 A* s denoting the pseudo-inverse of As- 
Then © recovers all X G C NxL with suppA" = S from 
Y = AX. 

Note, that the condition above does not depend on the number 
of channels. In the next section we will derive a condition 



similar to (|6]l that involves the 2-norm instead of the 1-norm, 
and is therefore weaker (namely, easier to satisfy). 

Assuming the columns of A are normalized, \\ag \\2 — 1, we 
can guarantee that (O holds as long as the coherence p, of A 
is small enough, where [17] 



max |(a,,-, i 



(7) 



The following result follows from [20] by noting that the block 
coherence in this setting is equal to fi/d. 
Proposition 2.2: Assume that 



(2fc- < 1. 
Then © recovers all X with \\X\\ Q < k from Y = AX. 



(8) 



Under the same conditions as in Propositions 12.11 and 12.21 it 
is shown in [43] that BP will recover a single fc-sparse vector. 
Therefore, if (0 holds, then instead of solving ([5]) we can use 
BP on each of the columns of Y. 

The coherence is lower bounded by [41] 



> 



N ■ 



n(N - 1)' 



(9) 



The lower bound behaves like 1 / y/n for large N, which limits 
the Proposition 12.21 to maximal sparsities k — O(^fn). To 
improve on this we can generalize existing recovery results 
[8], [6] based on RIP to the multichannel setup. The restricted 
isometry constant 6k of a matrix A is defined to be the smallest 
constant 5k such that 



(1 - tf fc )||a;||| < < (1 + tf fc )||a;|||, 



(10) 



for all fc-sparse vectors x. The next proposition follows from 
[21]. 

Proposition 2.3: Assume A E C nxN with o~ 2 fc < V2 - 1 



Let X e 
©. Then 



iNxL 



Y = AX, and let X be the minimizer of 



\X-X\\ F < Ck- 1/2 \\X ~ X {k) 



where C is a constant, ||X||p = y/Tr(X*X) is the Frobenius 
norm of X and X^ denotes the best fc-term approximation 
of X, i.e., swppX^ consists of the indices corresponding to 
the fc largest row norms ||X £ ||2. In particular, recovery is exact 
if | supp X\ < k. 

It is well known that Gaussian and Bernoulli random matrices 
A e R nxN satisfy 5 2 k <V2-1 with high probability as long 
as [1], [10] 



n > Ck\og(N/k). 



(11) 



For random partial Fourier matrices the respective condition is 
n > ck\og i (N) [37], [39]. Therefore, Proposition 12.31 allows 
for a smaller number of measurements. However, there is still 
no dependency on the number of channels. Indeed, under the 
same RIP condition BP will recover a single fc-sparse vector 
and therefore, as before, BP may as well be applied to each 
of the columns of Y individually. 

We conclude this overview by stressing that known equiv- 
alence results do not improve on those for single channel 
sparse recovery. In [21], [20] equivalence results are derived 
for a mixed £2.1 program when different measurement matrices 
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Ai are used on each channel. In this case, even worst-case 
analysis shows improvement over L = 1. However, when all 
measurement matrices are equal, the recovery conditions do 
not show any advantage with multiple signals. 



III. A Recovery Condition 

Before turning to analyze the average-case behavior of 
(0, we first develop a new condition on A that allows for 
perfect recovery. This formulation will be useful in deriving 
the average-case results. 

In the following theorem we give a sufficient condition on 
the minimizers of (0. This theorem generalizes a result of 
[44], [23] for the L = 1 case. To this end we denote by 



sgn(X) e 



^NxL 



the matrix with entries 



Xij 
WW: 

o, 



\x l 
\x* 



| 2 ^0; 
| 2 = 0. 



(12) 



In this definition, each element of X is normalized by the norm 
of the corresponding row. When L = 1, sgn(X) reduces to 
the sign of the elements of the vector x. 

Theorem 3.1: Let X € C NxL with suppX = S and 
assume As to be non-singular. If there exists a matrix H £ 
C nxL such that 

A* S H = sgn(X s ), (13) 

and 

\\H*a e \\ 2 <l for all l$S (14) 
then X is the unique solution of (0. 

Before proving the theorem we note that the two conditions 
on H easily imply that 



\H* 



< 1, 



for all 



Proof: The proof follows the ideas of [44], with appro- 
priate modifications to account for the mixed £2,1 norm that 
replaces the l\ norm. 

Let Y = AX, and assume there exists a matrix H such that 
X, H satisfy ( fl~3T > and ( TBI . Let X' be an alternative matrix 
satisfying Y = AX'. Our goal is to show that ||X||2.i < 
11*11 2 i- To this end, we note that 



l^lki = ll* S !ki = Tr (sgn(X s )(X s )* 



(16) 



where Tr denotes the trace. Substituting A* S H = sgn(X ) 
into ([Tol l, and using the cyclicity of the trace we have 



|X|| 2 ,i =Tr (H*A S X S ) = Tr(H*AX') 



(17) 



Tr (X' s> H*A S , 



i s ^~ = Y = AX' and S' 
denotes the support of X'. We next rely on the following 
lemma. 

Lemma 3.2: Let A, B be matrices such that AB is defined. 
Then |Tr(BA)| < ||-B|| 2 ,i maxg ||^|| 2 , with strict inequality 
if || Ai\\ 2 < max£ ||.i4.f H2 for some value of £ for which 
\\B%^0. 



Proof: The proof follows from noting that 

|Tr(^)|<^|B%|<^||i? £ ||2||^||2 

t f 



< max||Af||2^||^|| 2 =maxp 



-ih\\B e 



2,1; 



where the second inequality is a result of applying Cauchy- 
Schwartz. Under the condition of the lemma, we have strict 
inequality in the last inequality. ■ 
Applying Lemma 13.21 to ( fTTl ), leads to 

||Alki < ll*' S 'l|2,imax||i/M,|| 2 < \\X' S '\\ 2A (18) 



\X'\ 



2,1) 



where the last inequality follows from (fl~5b . We have strict 
inequality in the first inequality of ( fT~8b as long as the values 
\\H*A£\\ 2 for £ e S' are not all equal since ||X' , '||2 ^ for 
all £ £ S' be definition of the support. 

Suppose to the contrary that ||77*A£|| 2 = a for all £ £ 
S'. Clearly, 5" must contain at least one index £ that is not 
contained in S; otherwise S' C S, which would contradict the 
hypothesis that As is non-singular, AgiX' — AsX and X 7^ 
X' . By our assumption ||£f*a^||2 < L which then implies that 
a < 1 or HffM^Ha < l,t £ S'. The inequalities in ([Til) then 
become 

||*||2a < ||X' s '||2,imax||ffM,|| 2 < ||X' s '|| 2 ,i = ||X'|| 2 ,i. 

(19) 

Thus, we have shown that ||X'||2,i > ||*||2,i for any X' 
such that Y = AX', and therefore © recovers the true sparse 
matrix X. ■ 
Choosing H — (A' s )* sgn(Xs) in Theorem 13.11 results in 
the following corollary. 



Corollary 3.3: Let X £ 



iNxL 



with suppX = S and 



(25) assume As to be non-singular. If 



sgn(X s yAla e \\ 2 < 1 for all £ <£ S, 



(20) 



then X is the unique minimizer of (O. 

This corollary will be instrumental in proving the average-case 
performance of (0. It can easily be seen that Corollary [33] im- 
plies Proposition ^. II This follows from the triangle inequality, 



n(X s )*Ala e 



J2(Ai s a e ) jS gn(Xi 



<£l(4"€)jl ||sgn(^)||2 = ||4a,|| 1 , 
where we used the fact that || sgn(X : ')|| 2 = 1. 

IV. Average Case Analysis 

Intuitively, we would expect multichannel sparse recovery 
to perform better than single channel recovery. However, in 
the worst case setting this is not true as already suggested by 
the results of Section [II] The reason is very simple. If each 
channel carries the same signal, Xi — x for £ = 1, . . . , L, 
then also the components of Y = AX are all the same and 
we do not have more information on the support of X than 
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provided by a single component Yg. The following proposition 
establishes formally that if BP fails for a given measurement 
matrix A, then multichannel optimization (O will fail as well 
so that in the worst-case, adding channels will not improve 
performance. 

Proposition 4.1: Suppose there exists a fc-sparse vector x £ 
C N that ^-minimization is not able to recover from y = Ax. 
Then £2.1 -minimization fails to recover X = (x\x\ ■ ■ ■ \x) £ 
C NxL fvomY = AX. 

Proof: If £\ -recovery fails on some fc-sparse x then 
necessarily \\x'\\i < \\x\\i for some x' satisfying Ax' = Ax. 
Clearly X = (x\x\ ■ ■ ■ \x) is (jointly) fc-sparse and AX = AX' 
for X' — (x'\x\ ■ ■ ■ \x'). Furthermore, 

||X / || 2il = >/I||i'||i < VL\\x\\! = \\X\\ 2 ,i 

and therefore X is not the unique minimizer of the £2.1- 
minimization problem. ■ 

Realizing that <(5j is not more powerful than usual BP in 
the worst case, we seek an average-case analysis. This means 
that we impose a probability model on the fc-sparse X. In 
particular, as in [25], we will assume that on the support S of 
size fc the coefficients of X are chosen at random. We then 
show that under a suitable probability model on the non-zero 
elements of X, the condition given by Corollary [33] is satisfied 
with high probability, which depends on L. 

We follow the probability model used in [25]: let S be the 
joint support of cardinality fc. On S the coefficients are given 

by 

X s = (21) 

where S = diag(<jj, j £ S) £ R kxk is an arbitrary diagonal 
matrix with positive diagonal elements <Xj. The matrix $ will 
be chosen at random according to one of the following models. 

• Real Gaussian: each entry of $ £ M fcxL is chosen 
independently from a standard normal distribution. 

• Real spherical: the rows of $ £ M kxL are chosen 
independently and uniformly at random from the real 
sphere S L ~ l . 

• Complex Gaussian: the real and imaginary parts of each 
entry of $ £ <C kxL are chosen independently according 
to a standard normal distribution. 

• Complex spherical: the rows of $ £ C kxL are chosen 
independently and uniformly at random from the complex 
sphere Sq~ . 

Note that taking £ to be the identity matrix results in a 
standard Gaussian random matrix X s , while taking arbitrary 
non-zero <jj's on the diagonal of £ allows for different 
variances. The matrix £ may be deterministic or random. In 
particular, choosing £ to be the matrix with diagonal elements 
given by the inverse ^2-norm of the rows of $ in the real 
(complex) Gaussian model, leads to a matrix X s with a real 
(complex) spherical distribution. 

In Theorems 14.41 and !4.5l below we develop conditions under 
which (0 recovers X from Y = AX with probability that 
decays exponentially with L. The condition in both theorems 
is given in terms of an upper bound on || A' s ag H2 for I not in S. 
This is in contrast to the worst-case result of Proposition 12.11 
that is given in terms of || A^a^|| 1 and therefore stronger. 



The essential idea in both proofs is to show that if the 
bound on 11^4^0.^11 2 is satisfied, then the sufficient condition 
of Corollary 13.31 holds with high probability. 

Before stating the first theorem, we derive the following 
result on the norm of sums of independent random vectors, 
uniformly distributed on a sphere. 

Theorem 4.2: Let a £ C k and let Zj, j = 1, . . . , fc, be a 
sequence of independent random vectors which are uniformly 
distributed on the real sphere S L ~ 1 . Then for any u > 1 



E 

i=i 



a,j Zj 



> u\\a\\ 



< exp 
Proof: See Appendix J] 



-( w 2 -io g ( u 2 )-i; 



Theorem 14.21 generalizes the Bernstein inequality for Stein- 
haus sequences in [46, Theorem 13] to higher dimensions. We 
may extend the estimate easily to random vectors uniformly 
distributed on complex unit spheres. 



Corollary 4.3: Let a £ C and let Zj, j = 1, 



, fc, be a 



sequence of independent random vectors which are uniformly 



distributed on the complex sphere Sh l . Then for any u > 1 



E 



CLjZj 



> u\\a\\ 



that a,j £ 



< exp (-L{u 2 - \og(u 2 ) - 1)) . 
Proof: First observe that ajZj has the same distribution 
aAZi. We may therefore assume without loss of generality 
R. Next, a random vector Z £ S^ 1 is uniformly 
distributed on S^T 1 if and only if (Re(Z) T ,lm(Z) T ) T is 
uniformly distributed on the real sphere 5 2i_1 . Applying 
Theorem 14.21 with L replaced by 2L yields the statement. ■ 
With this tool at hand we can now easily prove the following 
average-case recovery theorem. 

Theorem 4.4: Let S C {1, . . . , N} be a set of cardinality fc 
and suppose 

Ps^|| 2 < a < 1 for all I $ S. (22) 

Let X £ R NxL with suppX C {1,...,N} such that the 
coefficients on S are given by (f2TT > with some diagonal matrix 
S G R kxk and $ £ R kxL chosen from the real Gaussian or 
spherical probability. Then with probability at least 



log(a- 2 ) - 1) 



(23) 



© recovers X from Y = AX. 

If the real probability model is replaced by one of the two 
complex models then L/2 can be replaced by L in d23l . 
For a < 1 we are guaranteed that the exponent in d23l has 
a negative argument, and therefore the error decays exponen- 
tially in L. 

Proof: First observe that by the rotational invariance 
of Gaussian random vectors the columns of sgn(X s )* = 
sgn($*) are independent and uniformly distributed on the 
real sphere, and the same is also true if we use the real 
spherical random model. Denote = A^ae for £ ^ S and 
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byZj,j = l, 



, k a sequence of independent random vectors 



that are uniformly distributed on the sphere S L ~ l . Using the 
sufficient recovery condition of Corollary 13. 31 the union bound 
and Theorem 14.21 we can estimate the probability that £2,1 
minimization fails to recover X by 

P(max|| sgnpf^** 



U > 1) 
sgn(X s )*6( £ )|| 2 > 1) 



< 



E 1 

e<£s 



k 

E 

i=i 



< (N - 



L 



> a 



&)exp ( ~2^ a 2 



log(a~ 2 ) - 1) 



The complex case follows analogously using Corollary 14. 31 ■ 
For L = 1, Theorem 14.41 is contained implicitly in [46, 
Theorem 13]. The appearance of the 2-norm in (124-b instead 
of the 1-norm as in © makes the condition of the theorem 
weaker than worst-case estimates (recall that ||a;||2 < ||x||i < 
Vk||;e||2 for any length-/c vector x). In Section IVl this will 
be made more evident when we consider conditions on the 
coherence fi and the RIP constant to allow for recovery with 
high probability. The requirement we obtain on \i is weaker 
than that of Proposition ^. 21 and allows for recovery with k on 
the order of n, while the worst-case results limit recovery to 
order y/n. Furthermore, in contrast to the worst-case results 
which depend on #2fc> we will show that high-probability 
recovery is possible as long as 8k+i is small enough. 

It is evident from d23l that the failure probability decays 
exponentially with growing number of channels L. Moreover, 
the bound is also useful for small L, and in particular for 
the monochannel case L = 1. Indeed, a simple algebraic 
manipulation shows that the failure probability is less than 
e provided ||4°<||2 < a for all £ £ S with a satisfying 



log (of 



> 



21og(JV/e) 



1. 



This provides a useful average-case analysis even for L = 1. 

For completeness, we also state an alternative recovery re- 
sult below which provides a slightly better probability estimate 
than Theorem 14.41 for very large values of N. However, the 
required condition on || ^4g.a£ |j 2 is stronger. 

Theorem 4.5: Let S C {1, . . . , N} be a set of cardinality 
k, and let X e R NxL be random sparse coefficients with 
supp X — S given by the real Gaussian probability model. If 

114^112 < Al 1 



for all I £ S, where 

A L = V2 



2Vfc 
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(24) 



r((£ + l)/2) 



(25) 



r(i/2) 

and r denotes the Gamma function, then with probability at 
least 

P = 1 - cxp(- J L/8) -kexp(-A 2 L /8) 
(0 recovers X from Y = AX. 



It follows from Stirling's formula T(z) 
that 



A, 



V2 



r((L + i)/2) 
r(L/2) 



V2 



/2TTzz z - 1/2 e- z , 

((£ + l)/2) L / 2 e-( L+1 >/ 2 
(L/2Y L ~ 1 )/ 2 e~ L / 2 



- 1/2 ^ff-^ 1/2 (^ + Vi) L ) 1/2 -vi. 



L(i-i)/ 

Moreover, for all L > 1 it holds that \fh > A L > \f^^L ~ 
0.797VX. 

Note that 7 = 3y /-£+ 2 y/k~ ' s monoton i ca Hy increasing in 
L. In addition, the probability P is also increasing (towards 
1) in L. Therefore, more channels increase the probability of 
success and in addition relax the requirements on the matrix 
A. 

Proof: To prove the theorem we show that if ( l24b is 
satisfied, then condition (|20b of Corollary 13.31 holds with 
probability P. 

To this end, let $ € R fcxL denote a random matrix with 
independent standard normal distributed entries, and define D 
as the k x k diagonal matrix with diagonal elements l/sj,j 6 

S, where Sj = W&W2 = \fYle=i \^jA 2 - We can then express 
sgn(X s ) = sgn(E$) = sgn(<E>) = (This equation also 
means that the diagonal matrix S does not play any role.) 
Denoting b a,j for j ^ S, 

|| sgn(X s )*6 i || 2 = W&Dbjh < \\n2\\D\\ 2 \\b 3 \\ 2 . 

By the assumption of the theorem ||6j||2 < 7 where 7 is 
defined by d24"l) . It therefore remains to bound 1 1 $ 1 1 2 and 1 1 D \ \ 2. 
^From [10, equation (4.35)], see also [42], the operator norm 
of $ satisfies 

||$||2< VZ + Vk + r (26) 



with probability at least 1 — exp(— r 2 /2) 

ve 

distributed. Therefore, denoting a x 2 (£)-variable by Y, 



Next we consider ||-D||2- Observe that the s 2 are x 2 (X) 



i[8j] = e[V7] 



2V2r(L/2) Jo 



rEx L ' 2 e x ' 2 dx 



= ^r((x + i)/2 ) = 
r(i/2) 



sfl. 



As a function of $ J the Sj are Lipschitz continuous, i.e., 
Sj(& - < \\& - ^ \\ 2 . Using these two observations 
we rely on the following standard concentration of measure 
result, see e.g. [28, eq. (2.35)] or [29, eq. (1.6)]. 

Theorem 4.6: Let / be a Lipschitz function on R L , i.e., 
\f(x) — f(y)\ < B\\x — J/H2 for all x, y € M L . Further assume 
that Z = (Z%, Z 2 , . . . , Zl) is a vector of independent standard 
Gaussian random variables. Then 

P(/(Z)>E[/(Z)]+i)<exp ' 



¥(f(Z) < E[f(Z)} -t)< exp 



2B 2 J 
2B 2 J 



Our goal is to show that ||-D||2 is bounded from above, which 
is equivalent to bounding the smallest value of Sj from below. 
Applying Theorem 14.61 to Sj, 

F( Sj <A L (1- t)) < exp(-i 2 A 2 /2), 
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where we used the fact that B = 1 and 
union bound over all j, we obtain 



Al- Using a 



( Sj <A L (l-t),Vj) = 

A^l- 
jeS 



<A L (l-t) 



t)) = kexp(-rAi/2) 



< 



Assuming that mmj^s Sj > Al(1 — t) holds, \\D 
1/(Al(1 — t)). Combining this bound with d26i > for r = \[Ls 
we have 



\sgn(X s )Ala 3 h < 



y/k + VX + s\J~L 



A L (l-t) 

(s + 1 + y/k/L)jy/L 



-1 



(l-t)A L 

Choosing s = t= 1/2, 

|| sgn(X s )*4a,|| 2 < (3 + 2y/k/L)jVL/A L < 1. (27) 

^From (|27| i and Corollary 13.31 X is recoverable using ||5). 

The probability that d27l i does not hold can be computed by 
applying a union bound to the probabilities that the spectral 
norms of each of the matrices $ and D are not bounded. 
This shows that d27l i does not hold with probability at most 
exp(— L/8) + fccxp(— A|/8) completing the proof of the 
theorem. ■ 

V. Bounded Norm Condition 

Both Theorems 14.41 and 14.51 state that X can be recovered 
with high probability from Y, as long as ||Aga«||2 is bounded. 
In this section we develop several different conditions under 
which this holds. 

Proposition 5.1: Let A e (£ nxN have unit-norm columns 
and coherence /i, and let S C {1, ...,N} be a set of 
cardinality k. Assume that 

(Vk+(k-l)6)fi<S (28) 

for some 5 > 0. Then ||A^|| 2 < 6 for all I £ S. 

Proof: Gershgorin's disk theorem implies that the small- 
est eigenvalue A m i n of A* s As is bounded from below by 
1 — (k — l)/z. In particular, A* S A$ is invertible provided 
(k - < 1. Further, 



ll^lk 



j^2\(a e ,a 3 )\ 2 < Vkfi, 

its 



since by definition, |(af, fflj)| < /J. Now, using the fact that 

4 = (A* S A S )- 1 A* S , 

\\4aeh<\\(A* s A s )-%\\A%a e h 

< (1 - (k - l)^)- 1 ^^ < 5, 



where the last inequality follows from the fact that d28t implies 
S > Vk/(1- (k-l)fi)- 1 . ' ■ 

Condition d28l is slightly weaker than (JHJ as long as 6 > 
1/y/k. This follows from the 2-norm that replaced the 1-norm 
in the upper bound. However, ( 1281 still suffers the square-root 
bottleneck k = 0(y/n). To improve on this result, we next 



provide a condition based on the following refinement of the 
RIP of A. For a set S C {1, . . . , N} we let 

5(S) = \\A* S A S - I\\ 2 - 

The restricted isometry constant 6k of (TToT > satisfies 5k = 
max|5|< fc DA^Ag — J||2 so that if S has cardinality k then 
S(S) < S k . We further define 



6*(S) = max6(SU{£}). 



(29) 



Clearly, S(S) < S*(S) < &k+i- Finally, we make use of the 
following "local" 2-coherence function, 



02 (S) — max max || ^4_s a ^ll 2 j max \\A* S 



(30) 



for a subset S C {1,...,N}, where S \ £ denotes the 
elements in S excluding the Ah one. From the definition of 
the coherence it follows immediately that 

(31) 

,ctj)\ of the vector 



Ma OS) < VISIm, 

since the magnitude of each element \(ag 
A* s ai is bounded above by /i. In addition, 

M(S) <S*(S). 



(32) 



This is a result of the fact that A* s ae is a submatrix of 
A*su{i}Asu{t} — I f° r ^ ^ S, while Ag^r^ae is a submatrix 
of A* S A$ — I for £ E S. (They both consist of a subcolumn of 
the respective matrix, that "leaves" out the diagonal element.) 
We now use these definitions to bound ||AgG^||2: 
Proposition 5.2: Let S C {1, . . . , N}. Then: 

(a) If A satisfies S*(S) <6<l/2 then 

\\Alai\\ 2 < < 1 for all ^^5. 

(b) If A satisfies 5(S) < 8 < 1 and /i 2 (S) < rj then 



i4< 



< 



V 



1 



Proof: Denoting by A an eigenvalue of A* s As, the 
definition of 5(S) < S*(S) < 5 implies that |1 - A| < 8. 
Consequently, the smallest eigenvalue of AJAg is bounded 
from below by 1 — 8 and therefore 

1 



Us As 



< 



1-5 



For (a), as already noted above, A* s ae for £ ^ S is a 
x 1 submatrix of A^ ue ATut — I- Therefore, 1 1 ^4^a^ | j 2 < 
At ui A T u£ - Ih < 8, and 

||A^|| 2 < \\(A%A s )~ l A%at\\i 

< WiA'sAsrHM ' 



< 



1 



The proof of (b) follows from the fact that ||Aca^||2 < ^(S). 
A similar estimate as above yields ||4 a .«l|2 — (1 ~ S)^ 1 ^. ■ 
Proposition 15.21 applies if 6k+i is small while in contrast 
Theorem 12.31 works with 5 2 fc, which is generally larger than 
8h+i- By (fTTT i the condition 6k+i < S can be satisfied if n > 
Cgklog(N/k). Working with 5*(S) instead of 6k+i allows to 
improve on the bound ( fTTT i for Gaussian, Bernoulli and random 
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spherical matrices. 

Proposition 5.3: Let 5 c {1, . . . , N} be a set of cardinality 
k and suppose that A = ^$ G R" xAr , where $ is drawn 
at random according to a standard Gaussian or Bernoulli 
distribution (with expectation and variance 1/n). Then 
5* (S) < S with probability at least 1 — e provided that 



n > Ci<T 2 max{Mog(l/(5),log(iV/e)} 



(33) 



for a suitable constant. 

The same statement holds (with possibly a different con- 
stant) for a random matrix whose columns are chosen inde- 
pendently at random according to the uniform distribution on 
a sphere. 

Proof: See Appendix Ull ■ 

A straightforward extension of the proof, as in [1], also shows 
that a random matrix A G M nxJV with independent columns 
drawn from the uniform distribution on the sphere satisfies 
RIP, Sk < 5 with probability at least 1 — e provided n > 
C5~ 2 (fclog(7V I k) +log(e~ 1 )). Although this fact seems to be 
known, we are not aware of reference where this is rigorously 
stated. 

The next result relies on a theorem by Tropp [46, Theorem 
B] that uses random support sets S and allows to work with 
the coherence /i alone. Note that choosing S at random is 
perfectly in line with an average-case analysis. 

Theorem 5.4: Let A G C nxN have unit norm columns and 
coherence p. Let S C {1, . . . , N} be a set of cardinality k > 4 
chosen uniformly at random. Let S,e £ (0, 1) and assume that 

ffklogie- 1 ) < cS 2 , 



±\\A\\ 



< 



where c 



-1/2 



log(2)e 
4-144 log(3) 

A\at\\ 2 < 



- 4e l/4' 

6.64 • 1(T 4 . Then 



(34) 
(35) 



for all I 4 S 



(l-^vioiF 1 ) 

with probability at least 1 — e. 

Proof: The proof relies on [46, Theorem 12]. The 
formulation below follows from [46] by setting s = 
log(e" 1 )/log(fc/2) and estimating log(fc/2 + l)/log(fc/2) < 
log(3)/log(2) for k > 4. 

Theorem 5.5: Assume A G C nxN has unit norm columns 
and coherence p. Let S C { 1 , . . . , N} be a set of cardinality 
k > 4 chosen uniformly at random. The condition 

k 



: \\A\\l<e-VH 



(36) 



v /l441og(3)log(2)-V 2 fclog(e- 1 ) 
implies 

P(||A^ s -/||>5)<e. 
Using ( |34l > and the value of c, the square-root in ( |36*l l becomes 
<5/(2e 1 / 4 ). Combining this with d35l > shows that d36*] > is satis- 
fied. Therefore, HAgAs — X|| 2 < S with probability at least 
1 — e, which implies that 

1 



Finally, 

Il4^lla < \\(AsAs)h\\A* s a e \\ 2 < j^^L 

~ (l-^VToilF 1 ) 

by using condition (1341 once more. 



A. Comparison With Worst-Case Results 

Our average-case analysis depends on ||4 a ^ll2, while the 
classical condition © of Proposition 12 . 1 1 depends on ||Ag a ^l|i 
and is therefore significantly stronger. Proposition 15.21 estab- 
lishes that the 2-norm condition can be satisfied as long as 
Sk+i < 1/2. This is clearly weaker than the worst case 
condition 62k < V% — 1 ~ 0.41 of Proposition 12.31 

Let us now compare worst-case and average results based 
on the coherence \i, by relying on Theorem l5.4l For simplicity, 
we consider the case in which A is a unit-norm tight frame, 
for which = In this case, ( |35l ) is equivalent to k < 



If additionally p = 



then conditions (f34t and 



< 



1-5 



d35l l are both satisfied for fixed e, 6 provided 

k < C'n. 

This beats the square-root bottleneck and even removes the 
log-factor present in estimates for the restricted isometry con- 
stants, see dTTT >. Moreover, we have the additional advantage 
that the coherence is much easier to estimate than the restricted 
isometry constants. 

Combining Theorem 15.41 with the average-case analysis of 
Theorems 14.41 and 14.51 shows that for a unit norm tight frame 
A of coherence /i multichannel sparse recovery by (0 can be 
ensured in the average-case provided k < CyT 2 , which can be 
as small as k < Cn. Moreover, the failure probability decays 
exponentially in the number of channels. 

In the next sections we provide further examples when we 
discuss particular choices of the matrix A. 

VI. Comparison with Multichannel Greedy 
Algorithms 

We now compare our results regarding £2,1 optimization 
to those obtained for the greedy algorithms p-thresholding 
and p-SOMP [25]. These are multichannel versions of simple 
thresholding and orthogonal matching pursuit. For 1 < p < 00 
they produce a fc-sparse signal X from measurements Y = 
AX using a greedy search. To this end, we improve slightly 
on previous average-case performance results in [25] for these 
algorithms in the noiseless setting. 

A. Greedy Methods 

In p-thresholding, we select a set S of k indices whose 
p-correlation with Y are among the k largest: 

Ky|| p > ||a*y|| p , w g s,vj $ s. (37) 

After the support S is determined, the non-zero coefficients of 
X are computed via an orthogonal projection: X s = A' S Y. 
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The p-SOMP algorithm is an iterative procedure. At each 
iteration, an atom index £ m is selected, and a residual is 
updated. At the first iteration the residual is simply Yq = Y. 
After M iterations, the set of selected atoms being Sm = 
{£ m }m=i> me new residual is computed as Ym = Y — 



Now, 



A Sm X m = {I - Ps M )Y where X M = A\ m Y and P Sm = 
As M Ag is the orthogonal projection onto the linear span of 
the selected atoms. The next selected atom fc/w+i is the one 
which maximizes the p-correlation with the residual Ym, 



\a*t Y] 



M\\p 



max acYj 

KKN 



M\\p- 



(38) 



Using the probability model (fJTJ average-case recovery 
theorems for p-thresholding and p-SOMP have been proven 
in [25], [24, Theorems 4,6,7,8]. We improve slightly on these 
in the following. (Note, however, that [25] also treats the 
noisy case.) Our first result generalizes the one in [40] to the 
multichannel setup. 

Theorem 6.1: Let A S C nxN have unit norm columns and 
local 2-coherence function P2{S) defined in d30t . Let X S 
R NxL with suppX C S where S C {1, . . . , TV}, and such 
that the coefficients on S are given by dH), X s = S$, where 
we choose the real spherical model for <£>. Set Y — AX and 
R = maxj <7j / min j <r,j . If 



e = RMS) < !> 



(39) 



then the probability that 2-thresholding applied to Y fails to 
recover X is bounded by 

iVexp (-L/2(0- 2 - log(<T 2 ) - 1)) . 

If we use the complex spherical model instead of the real 
spherical model then L/2 in the above probability estimate 
may be replaced by L. 



The probability bound of Theorem 16.11 is similar to that 
of Theorem 14.41 However, in contrast to our results for £2,1- 
minimization, success of thresholding suffers a dependency 
on the diagonal matrix E. The larger the ratio R, the stronger 
the condition d39l on the maximal allowed sparsity k, and the 
larger the probability of error. 

Proof: We proceed similarly as in [40]. We denote by 
the event that 2-thresholding fails. Clearly, 

P(6) = P(min||a*y|| 2 < max ||a?y|| 2 ) 

ies 14s 

< P(min||a*Y|| 2 < p) + P(max ||a|Y|| 2 > p), 

where p will be specified later. Denote by Zj, j 6 S, a 
sequence of independent random vectors which are uniformly 
distributed on the unit sphere of M. L . Then, 



»(Dun||ajy|| 2 <P) = 



mm 

ies 



< P 



E 



^a*aj(TjZj 



> Wn 



o-jZ* 



E 



a*ajOjZ* 



Substituting into (l40l . 



Pfmin 1 1 a' 

ies 



Yh< P ) 



^E 1 



E r! j"'-"j / j 



> o" min - p 



Choosing p = cr m i n /2 and applying Theorem 14.21 we obtain 

P(mm||a*r|| 2 <p) 

< fcexp(-L/2(6»- 2 - log((T 2 ) - 1)) 

where we used the definition of 9 and P2{S). Similarly we 
estimate 

P(max||a*r|| 2 > a min /2) 

<(N -k) exp(-i/2(6»- 2 - log(6»- 2 ) - 1)). 

Combining the two estimates completes the proof for the real 
case. Choosing the vectors Zj, j € S, from the complex unit 
sphere Sk and using Corollary 14.31 yields the statement for 
the complex case. ■ 

We now state the corresponding result for 2-SOMP, which 
slightly improves the one in [25] for the noiseless case. (Note 
that we restrict to p = 2 here, although the theorem is easily 
extended to general values of p.) 

Theorem 6.2: Let A be a matrix with unit norm columns 
and constants 5(S) 7 p 2 (S) < 1 where S C {1, ...,N}. 
Assume that 



/l2 (5) 2 + (l+ £ )(l- e )-V 2 (S) 



< 1 



(41) 



(40) 



l-S(S) 

for some e 6 (0,1). Let X be a random coefficient matrix 
with support S that is selected according to the real Gaussian 
probability model, see (ED, and let Y = AX. Then 2-SOMP 
applied to Y recovers X in k steps with probability at least 

l-iV2 fc exp(-e 2 A!), (42) 

where Al ~ \[L is given by (l25l l. 

If we use the complex Gaussian model instead of the 

real Gaussian model then the same conclusion holds with L 

replaced by 2L in d42l . 

Proof: See Appendix Hill ■ 
Remark 6.3: (a) Due to the factor 2 k the probability 
bound d42l becomes effective only when the number 
of channels becomes comparable to the sparsity k. This 
drawback is very likely due to the analysis and is not 
observed in practice. However, it seems to be very 
difficult to remove this factor by a more sophisticated 
proof technique. 
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(b) We require e < 1, so that the probability decay of (142) 
is potentially slower than that given by Theorem 14.41 

(c) With 5 = e=l/2 condition (HD is satisfied if /x 2 (A) < 
1/7 while the probability estimate d42l behaves like 1 — 
7V2 fc exp(-L/4). 

(d) With the estimates S(S) < S*(S) and (i 2 (S) < 6*(S), 
( |4TT i with e = 3/11 is implied by 

< 1/3. 

(e) By Proposition 15.21 the condition 6*(S) < 1/3 implies 
||^S^||2 < 1/2 for all £ ^ S, i.e., the bounded norm 
condition ( l22l of the average case recovery result for 
mixed £2,1- In other words, the condition in (d) for SOMP 
is slightly stronger than the one for £2,1- 

B. Comparison 

We now compare the average-case recovery conditions for 
mixed £2,1, thresholding and SOMP for the following choices 
of the matrix A which we will also use in the numerical 
experiments: 

1) Random spherical ensemble; 

2) Union of Dirac and Fourier; 

3) Time-Frequency shifts of the Alltop window. 

1) Random spherical ensemble: Assume that the random 
columns of A S M. nxN are independent and uniformly 
distributed on the sphere S 1 " -1 . Let S be a support set of size 
fc. Then according to Propo sition 15 . 2 1 the condition 1 1 ^4^a£ 1 1 2 < 
a < 1 of Theorem g3] is implied by S*(S) < xfe < 1/2, 
while by Proposition 15.31 the latter holds with probability at 
least 1 — e provided 

n > max{Ci(a)fc, C 2 {a) k>g(N/e)}. (43) 

Assuming, for example, a = 1/4, under the probability 
model (f2TT >. the probability that reconstruction by ^2,1 fails 
is bounded from above by N exp(— L/2(15 — log(16))) + e = 
N exp(-cL) + e with c w 6.1137. 

We now compare this result with the condition of Theorem 
[64] concerning thresholding. As noted in ( T32l . p. 2 (S) < S* (S). 
Therefore, by Proposition 15.31 we have 

6 = 2Rp, 2 (S) < 2R5*(S) < 1 

with probability at least 1 — e provided 

R 2 

n > C— max {fc log(fl/0), log(A/e)} (44) 

and the failure probability of thresholding is bounded by 
N exp(- J L/2(6'- 2 - log(6>- 2 ) - 1)) + e. 

Let us finally consider Theorem l6.2l for SOMP. By Proposi- 
tion [53] the condition 6*(S) < 1/3 in Remark l6~3l is satisfied 
with probability at least 1 — e provided 

n > max{Cifc, C 2 log(A/e)} (45) 

and the failure probability of SOMP is bounded by 

N2 k exp(-9/121 A\) + e (46) 

with A\ ~ L if the real Gaussian probability model is used. 



Conditions g3), (gUl, (05) for £ 2 ,i, thresholding and SOMP 
are rather similar. However, condition d44l > for thresholding 
involves the ratio R. If R is large then thresholding behaves 
much worse compared to £2.1 and SOMP. The probability 
estimate d46*i i is the worst compared to the other two algorithms 
due to the factor 2 k . Therefore, £ 2t i gives the best known 
theoretical average case result. 

2) Union of Dirac and Fourier: Consider the nx2n matrix 
A = (I\F), where / is the n x n identity matrix and F is 
the normalized n x n Fourier matrix. The coherence of A is 
easily seen to be fi = 1/^/n. By Proposition 15.11 condition 
d22l . ||Ata^||2 < ct with a = 1/2 is satisfied for all support 
sets S of cardinality at most k provided 

fk fc- 1 1 
V n + 2^h < 2' 

If S is chosen at random then a much better bound (up to 
constants) is obtained using Theorem 15.41 In our special case, 
however, further improvement is possible. A reformulation 
of a result of [5], see also [46, Proposition 3] shows the 
following. If the support S consists of k\ arbitrary elements of 
{1, . . . ,n} and fc2 random elements of {n + 1, . . . , 2n} then 
with probability at least 1 — e we have 5(S) < 1/2 provided 

cn 

k = fci + fc 2 < (47) 

V lo s(cfe) +lo s( n ) 

with c = 0.25. In particular fc < n/4 and the same reasoning 
as in the proof of Theorem 15.41 yields 

HA^Ha ^ a = V 2 - 

Using one of the complex probability models in Theorem 
14.41 the failure probability of ^.l-minimization is bounded by 
Aexp(-L(4- log(4) - 1)) = Aexp(-cL) with c w 1.61. 

To compute the performance of thresholding, note that 
condition 05), 2Rp, 2 {S) < 2RfiVk < 6 < 1, is satisfied 
provided 

4R 2 

n > -p-k. (48) 

Assuming that the non-zero rows of the matrix $ in the 
probability model d2Tb on the coefficients are indepen- 
dent and uniformly distributed on the complex unit sphere 
S^ -1 , the failure probability of thresholding is bounded by 

N exp(- J L(6»- 2 - log(6»- 2 ) - 1)). 

Assuming 6(S) < S = 1/2 and [i\fk < 1/7, i.e., 

n > 49 fc, (49) 

the condition of Remark |6.3f c) is satisfied since by d32| ), 
p, 2 (S) < fxVk < 1/7. Then by Theorem [6721 SOMP fails with 
probability at most N2 k exp(— A 2 L /A) assuming the complex 
Gaussian probability model. Assuming as in the discussion 
of £2,1 that the support set is such that fci arbitrary elements 
of {1, . . . , n} and fc2 random elements of {n + 1, . . . , 2n} 
are chosen with fc = k\ + k 2 then the assumed condition 
S(S) < 1/2 is true with probability at least 1 — e provided 
g7) holds. 

Similar conclusions on the comparison of the three algo- 
rithms as in the previous example apply. We note, however, 
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that in contrast to £2,1 and SOMP, the performance bound 
for thresholding does not require a probability model on the 
support set S. 

3) Time -Frequency shifts of All top window: Let n > 5 be 
a prime. Denote by (T r g)( — g^ r mo d n and (M s g)e — 
e 2-msi/ng^ tne cyclic shift and modulation operator, respec- 
tively. Then T r M s , r, s = 0, . . . , n — 1 forms the set of time- 



frequency shifts. Let gi 



i 2mi /" be the so-called Alltop 



window. Then define A to be the nx n 2 matrix with columns 
being the time-frequency shifts T r M s g, r,s = 0, . . . , n — 1. 
The coherence of A is /i = l/v^ [41]- 

As in the Fourier-Dirac case, under condition d48b and the 
complex probability model of Theorem 16. II thresholding fails 
with probability at most N exp{-L(9^ 2 _ l g(6l~ 2 ) - 1)). 

For the analysis of I2.1 and SOMP we assume that the 
support £ is chosen uniformly at random. As A is the union 
of n orthonormal bases we have \\A\\^ = n. Then choosing 
5 = 3/4 in Theorem 15.41 yields that under the condition 

n > CfclogO" 1 ) 

with a constant C (which also implies d35l l) we have 

HA^|| 2 < S^log- 1 / 2 ^" 1 ) < a for all I S 

with probability at least 1 — e where a = w 0.0773. By 
Theorem |4.4| using one of the complex probability models, the 
failure probability of £ 2 ,i is then bounded by N exp(— c 2 L)+e 
with c 2 = or 2 — log(a~ 2 ) — 1. 

For the analysis of SOMP we choose 6 = 1/2 in Theo- 
rem 15.51 Assuming that the square-root in (f36l > is less than 



in 



2 is equivalent to 

n > Cklogie' 1 ) 



(50) 



with an appropriate C, and condition ( I36l l is satisfied. Then 
with probability at least 1 - e we have S*(S) < 1/2. 
Furthermore, as suggested by Remark I6.3f b) the condition 
^{S) < 1/12 is also implied by d50t since ^(S) < Vkfi — 
Assuming the complex Gaussian probability model on 
the non-zero coefficients of X the failure probability of SOMP 
is bounded by N2 k exp(-A^ L /2) + e due to Theorem l6\2l 

VII. Numerical Simulations 

We tested the three algorithms £2.1 minimization, thresh- 
olding and SOMP using the three different types of matrices 
indicated in the previous section. The support set S of the 
sparse coefficient matrices X was always selected uniformly 
at random while the non-zero coefficients were selected at 
random using one of the following choices of the probability 
model (E), X s = £$: 

1) $ is chosen to be a real Gaussian random matrix (i.e., all 
entries independent and standard normally distributed); 
S has independent diagonal entries with standard normal 
distribution. 

2) $ is chosen to be a complex Gaussian random matrix 
(i.e., the real and imaginary parts of each entry are 
chosen independently according to a standard normal 
distribution); S is equal to the identity. 



Note that S = / is favorable for thresholding, while the choice 
of E should have no influence on the performance of £2.1 and 
only a mild influence on SOMP. 

In the following figures the results of various simulation 
runs are plotted (we always used 100 simulations for each 
choice of parameters). 

In Fig. Q] we plot the results when choosing A from a 
random spherical ensemble of size n = 32 columns and 
N = 256 rows for L = 1,2,4. The matrix X was generated 
according to model (1). The improvement with increasing L 
is clearly evident. 
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Fig. 1. Multichannel recovery with X generated according to model (1) 
and A chosen from a random spherical ensemble, (a) ^2,1, (b) SOMP, (c) 
Thresholding. 
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In Fig. [2] we consider all three methods when A is a union 
of Dirac and Fourier bases, each with 32 elements. Therefore, 
n = 32 and N = 64. The matrix X was generated according 
to model (2). In this setting the performance using thresholding 
is reasonable, though still worse than £2.1 and SOMP. 
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Fig. 2. Multichannel recovery with X generated according to model (2) 
and A a union of the Dirac and Fourier bases, (a) £2,1, (b) SOMP, (c) 
Thresholding. 

Finally, in Fig. [3] we plot the results when using time- 
frequency shifts of the Alltop window with n — 29 and N = 
29 2 = 841. Here the results of thresholding are extremely poor 
and therefore not plotted. 

In all three cases, SOMP performs better than the £2,1 
approach. However, both show clear performance advantage 
with increasing L. 
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Fig. 3. Multichannel recovery with X generated according to model (2) and 
A chosen as time-frequency shifts of the Alltop function (a) ^2,1 (b) SOMP. 



VIII. Conclusion 

In this paper we analyzed the average-case performance 
of £2,1 recovery of multichannel signals. Our main result is 
that under mild conditions on the sparsity and measurement 
matrix, the probability of failure decays exponentially with 
the number of channels. To develop this result we assumed 
a probability model on the non-zero coefficients of a jointly 
sparse signal. The results we obtained appear to be the best- 
known theoretical results on multichannel recovery. Using 
the tools we developed for analyzing the ^2,1 approach, we 
also improved slightly on previous performance bounds for 
thresholding and SOMP. 



Appendix I 
Proof of Theorem I4.2I 

The proof uses the following extension of Khintchine's 
inequality to higher dimensions stated in [27], 



< - 



p/2 r (£±ej 



for all p > 2 and all vectors a€l .By splitting in real and 
imaginary parts it easily follows that this inequality also holds 
for all a € C fc . We may assume without loss of generality that 



13 



; | ci 1 1 2 = 1- Then an application of Markov's inequality yields 



> u 



I 


f 




*\ \ 


exp 


XL/2 


^ a 3 Z 3 


> exp(XLu 2 /2) 


V 




3 


J ) 











< exp(-ALu 2 /2)E 


cxp ^XL/2 


3 


31 


OC 




2 

k 


i 



= exp(-ALu 2 /2) ^{XL/2) 1 



i=0 

oo 



^Z a 3 Z 3 

3 = 1 



< exp(-Aiu 2 /2) x ' 



i=0 

oo 



T(L/2 + i) 
i\T(L/2) 



exp(-Ai U 2 /2)V^^A 4 



i=0 



exp(-ALu 2 /2)- 



(51) 



(l-A)Va' 

where (a)j = a(a + l)(a + 2) • •• (a + i — 1) denotes the 
Pochhammer symbol. The last equation is due to the fact that 
X^o ^ir^ 1 i s me Taylor series of (1 — A)~ a , which converges 
for A < 1. Minimizing ( BP with respect to A gives A = 1 — 
vT 2 . Inserting this value yields the statement of the theorem. 

Appendix II 
Proof of Proposition [53J 

Consider first the case of Gaussian or Bernoulli matrices. 
According to Theorem 2.1 in [38] (see also Lemma 5.1 in [1]), 
we have ||AgA5 — J|| 2 > S with probability at most 2(1 + 
12/S) k exp(— c /9n<5 2 ) with c = 7/18. A similar estimate 
holds for H-Agu^-Asuf — /|| 2 with £ ^ S. A union bound over 
all I £ S yields 8*{S) > 5 with probability at most 2N(l + 
12/5) k cxp(-c /9n<5 2 ). This term is less than e if 05) holds. 

Now consider a random matrix "$! € M" xAr with indepen- 
dent columns that are uniformly distributed on the sphere 
S n ~ 1 . Then has the same distribution as DA, where A 
is Gaussian matrix as above, D = diag(s^ 1 , . . . , s^ 1 ) and 
sj = \/n.\\$j\\2 where <5j G R™ is a vector of independent 
standard normally-distributed random variables. We now use 
the following measure concentration inequality [3, Corollary 
(2.3)] or [4, eq. (2.6)] for a standard Gaussian vector Z € R ra , 

P(||Z||I > -^-) < exp(- 7 V4), 
1-7 

< (1-7)") < exp(- 7 2 n/4). 
By a union bound this implies that 

1 — 7 < min s 2 < max s 2 < 

j'=l,...,JV J j=l,...,N J 1 — 7 



> 1 - 27Vexp(-7 2 n/4). 



(52) 



By the above reasoning, we have (1 — <5/3)||x||| < 1 1 ^4a? 1 1 2 < 
(1 + <5/3)||x||2 for all x with suppx C S U {£} for some 



t ^ S with probability at least 1 — e provided ( 155) holds with 
a suitable constant. If additionally 1 — 7 < mirij =1 jf s 2 < 



< j±- for 7 = (5/4 then (1 



< 



\\DAx\\l = ll^xlH < (1 + 5)\\x\\l for all x with suppa; c 
S U {^} for some £ ^ 5. By a union bound and d52) this 
holds with probability at least 1 — 2e provided d33) holds and 
27Vexp(— (5 2 n/64) < e, the latter being equivalent to n > 
64<5 2 log(2AT/e). Adjusting the constant in (133) completes the 
proof. 



Appendix III 
Proof of Theorem IO] 



We assume that until a certain step SOMP has selected only 
correct indices, collected in J C S. Let us first estimate the 
probability that it selects a correct element of S \ J also in 
the next step. 

We denote by Pj = AjA^j the orthogonal projection onto 
the span of the columns of A in J, and Qj = I — Pj, The 
residual at the current iteration is given by Ym — QjY = 
QjA s X = SOMP selects a correct index in S\ J 

in the next step if 

max \\alQjA s j:^\\2 > max ||af QjA s £$|| 2 . (53) 



By Theorem 1 1 in [25] (which is proven using Theorem 14.61 
note that there is a slight error in [25] in the computation 
of the constant Al) we have the following concentration of 
measure inequalities 

max ||a|QjA 5 S$|| 2 < (l + e)C 2 (L)x 
ees\j 

x max \\a* e QjAsT,\\ 2 ) < exp(-e 2 ,4|), 
ees\j J 

P ( max || a* e QjA s ^ || 2 > (l-e)C 2 (L)x 

xmaxKQjA 5 £|| 2 ) < \S C \ exp(-e 2 A|), 

where Al is the constant in d25) and C 2 (i) = E||Z|| 2 with 
Z = (Zi, . . . , Zl) being a vector of independent standard 
normal variables. Now we assume that 



(l + e)C 2 (L) max K*Qj,4 s £|| 2 
> (l-e)C7 2 (i)max||a|QjAsS||2. 



(54) 



Then by the above and a union bound the probability that 
SOMP fails can be bounded by 



"(max \\a* e QjAsT,$\\2 < max ||a^Q, / A s i; < I , || 2 ) 

ies\j t$s 

<(|5 c | + l)exp(- e X). 



(55) 



Let us consider now the maximum on the right hand side of 
First note that Pjai = cie for all £ E J, in other words 
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)jai = 0. Hence, we can estimate 

TOBx.\\atQjA s ml = max\\Vg\jA* s \jQja e \\l 

<max rf\(Qj a j' a i)\ 2 



j£S\J 



< max Uj 2 max Y^ |(Qj%'j^)| 2 - 
jeS\J 

Furthermore, for £ ^ S we have 

v 1/2 

X] KQj-Oj)^)! 2 1 =\\A*s\jQja t \\ 2 

= \\A* SV (I - Pj)a e \\ 2 

< \\A*s\jath + \\ A *s\.j A j( A *J A Jy lA >ih 

< m {S \ J) + U^jAjhMA'jAj^hU'jaeh 



l-5(S)^ y ~ J l-S(S) 7 

where we used the fact that Ag^jAj is a submatrix of A* s As — 
I. 

Next we consider the maximum on the left hand side of 
(|54| |. We can estimate 

max \\a* e QjA s ml = max Y^ a 2 \(Qja e , a 3 )\ 2 
ieS\J ees\.i ^ . 3 



j£S\J 



Furthermore, for j E S \ J 

\(Qjaj,aj) \ = \ -Pj)aj,ctj)\ 

= \l-a*Aj(A}A J )- 1 A}a J \ 

yi-WAya^UAyAj)- 1 ]], 

>l-^(Sf(l-5(S))-\ 
Combining the above estimates, condition (l54l l is satisfied if 

(l + «)^>(l-«)(l- ^ 



1 - 5(5) V 1 - <*(S) / ' 

which is equivalent to d4TT >. 

In order to complete the proof, we note that OMP success- 
fully recovers the correct signal if d54l i holds for all J C S. 
By a union bound of d55l > over all those 2 k subsets this is 
true with probability at least 1 — N2 k exp(— e 2 Aj j ) provided 
condition d4lT i holds. 

The extension to the complex valued case is straightforward. 
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